A common use for repositories is to manage (among other things) files, so ModeShape 3 is now capable of handling even extremely large binary values that are larger than available memory. This is because ModeShape never loads the whole value onto the heap, but instead streams the value to and from the persistent store. And you can configure where ModeShape stores the binary values independently of where the rest of the content is stored.
How it works
The key to understanding how ModeShape manages the Binary values is to remember how the JCR API exposes them. To set a property to a Binary value, the JCR client creates the javax.jcr.Binary instance from the binary stream:
javax.jcr.Session session = ...
javax.jcr.ValueFactory factory = session.getValueFactory();
// Create the binary value ...
java.io.InputStream stream = ...
javax.jcr.Binary binary = factory.createBinary(stream);
// Use the binary value ...
javax.jcr.Property property = ...
property.setValue(binary);
// Save the changes ...
session.save();
Then, to access the binary content, the JCR client gets the property, gets the binary value(s), and then obtains the binary value's InputStream:
javax.jcr.Property property = ...
javax.jcr.Binary binary = property.getValue().getBinary();
java.io.InputStream stream = binary.getStream();
// Use the stream ...
When ModeShape creates the actual javax.jcr.Binary value, it reads the supplied java.io.InputStream and immediately stores the content in the repository's binary storage area, which then returns a Binary instance that contains a pointer to the persisted binary content. When needed, that Binary instance (or another one obtained at a later time) obtains from the binary storage area the InputStream for the content and simply returns it.
Note that the same Binary value can be read from one property and set on any other properties:
javax.jcr.Session session = ...
// Get the Binary value from one property ...
javax.jcr.Property property = ...
javax.jcr.Binary binary = property.getValue().getBinary();
// And set it as the value for other properties ...
javax.jcr.Property property = ...
property.setValue(binary);
// Save the changes ...
session.save();
This works because the Binary value contains only the pointer to the binary content, copying or reusing the Binary objects is very efficient and lightweight. It also works because of what ModeShape uses for the pointers.
ModeShape stores all binary content by its SHA-1 hash. The SHA-1 cryptographic hash function is not used for security purposes, but is instead used because the SHA-1 can reliably be determined entirely from the content itself, and because two binary contents will only have the same SHA-1 if they are indeed identical. Thus, the SHA-1 hash of some binary content serves as an excellent key for storing and referencing that content. The pointer we mentioned in the previous paragraph is merely the SHA-1 of the binary content. The following diagram represents how this works:
Using the SHA-1 hash as the identifier for the binary content also means that ModeShape never needs to store a given binary content more than once, no matter how many nodes or properties refer to it. It also means that if your JCR client already knows (or can compute) the SHA-1 of a large value, the JCR client can use ModeShape-specific APIs to easily determine if that value has already been stored in the repository. (We'll see an example of this later on.)
Extended Binary interface
The ModeShape public API defines the org.modeshape.jcr.api.Binary interface as a simple extension to the standard javax.jcr.Binary interface. ModeShape's extension adds useful methods to get the SHA-1 hash (as a binary array and as a hexadecimal string) and the MIME type for the content:
@Immutable
public interface Binary extends javax.jcr.Binary {
/**
* Get the SHA-1 hash of the contents. This hash can be used to determine whether two
* Binary instances contain the same content.
*
* Repeatedly calling this method should generally be efficient, as it most implementations
* will compute the hash only once.
*
* @return the hash of the contents as a byte array, or an empty array if the hash could
* not be computed.
* @see #getHexHash()
*/
byte[] getHash();
/**
* Get the hexadecimal form of the SHA-1 hash of the contents. This hash can be used to
* determine whether two Binary instances contain the same content.
*
* Repeatedly calling this method should generally be efficient, as it most implementations
* will compute the hash only once.
*
* @return the hexadecimal form of the getHash(), or a null string if the hash could
* not be computed or is not known
* @see #getHash()
*/
String getHexHash();
/**
* Get the MIME type for this binary value.
*
* @return the MIME type, or null if it cannot be determined (e.g., the Binary is empty)
* @throws IOException if there is a problem reading the binary content
* @throws RepositoryException if an error occurs.
*/
String getMimeType() throws IOException, RepositoryException;
/**
* Get the MIME type for this binary value.
*
* @param name the name of the binary value, useful in helping to determine the MIME type
* @return the MIME type, or null if it cannot be determined (e.g., the Binary is empty)
* @throws IOException if there is a problem reading the binary content
* @throws RepositoryException if an error occurs.
*/
String getMimeType( String name ) throws IOException, RepositoryException;
}
All javax.jcr.Binary values returned by ModeShape will implement this public interface, so feel free to cast the values to gain access to the additional methods.
Importing and Exporting
When exporting content from a workspace with large Binary values, be sure to export using JCR's System View format. Only the System View treats properties as child elements. This allows each large value to be streamed (using buffered streams) into the XML element's content as a Base64-encoded string. Importing can also take advantage of streaming.
Exporting content using JCR's Document View results in all properties being treated as XML attributes, and various XML processing libraries treat large attributes poorly (e.g., using values that are in-memory String objects). Another critical disadvantage of the Document View is that it is unable to represent multi-valued properties, since attributes can have only one value.
Implementation design
This section describes the internal design of how ModeShape stores binary values, and is typically useful to either understand the nuances of the various configuration choices or to implement custom binary stores.
None of the interfaces described in this section are part of the public API, and should never be directly used by JCR client applications.
BinaryValue
In addition to the ModeShape-specific org.modeshape.jcr.api.Binary extension, ModeShape also defines a org.modeshape.jcr.value.BinaryValue interface that adds several other features required to properly persist and manage Binary values. These other features that are part of ModeShape's internal design and therefore not appropriate for inclusion in the public API. Specifically, BinaryValue instances are themselves immutable, they have an immutable BinaryKey that is a comparable representation of the SHA-1 hash, they are comparable with each other (based upon their keys), they can be serialized, and the getSize() method does not throw an exception like the standard method:
@Immutable
public interface BinaryValue extends Comparable<BinaryValue>, Serializable, org.modeshape.jcr.api.Binary {
/**
* Get the length of this binary data.
*
* Note that this method, unlike the standard {@link javax.jcr.Binary#getSize()} method,
* does not throw an exception.
*
* @return the number of bytes in this binary data
*/
@Override
public long getSize();
/**
* Get the key for the binary value.
*
* @return the key; never null
*/
public BinaryKey getKey();
}
BinaryStore
The ModeShape-specific BinaryStore interface is thus defined to use the internal BinaryValue interface:
@ThreadSafe
public interface BinaryStore {
/**
* Initialize the store and get ready for use.
*/
public void start();
/**
* Shuts down the store.
*/
public void shutdown();
/**
* Get the minimum number of bytes that a binary value must contain before it can
* be stored in the binary store.
* @return the minimum number of bytes for a stored binary value; never negative
*/
long getMinimumBinarySizeInBytes();
/**
* Set the minimum number of bytes that a binary value must contain before it can
* be stored in the binary store.
* @param minSizeInBytes the minimum number of bytes for a stored binary value; never negative
*/
void setMinimumBinarySizeInBytes( long minSizeInBytes );
/**
* Set the text extractor that can be used for extracting text from binary content.
* @param textExtractor the text extractor
*/
void setTextExtractor( TextExtractor textExtractor );
/**
* Set the MIME type detector that can be used for determining the MIME type for binary content.
* @param mimeTypeDetector the detector; never null
*/
void setMimeTypeDetector( MimeTypeDetector mimeTypeDetector );
/**
* Store the binary value and return the JCR representation. Note that if the binary
* content in the supplied stream is already persisted in the store, the store may
* simply return the binary value referencing the existing content.
*
* @param stream the stream containing the binary content to be stored; may not be null
* @return the binary value representing the stored binary value; never null
* @throws BinaryStoreException if there is a problem storing the content
*/
BinaryValue storeValue( InputStream stream ) throws BinaryStoreException;
/**
* Get an InputStream to the binary content with the supplied key.
*
* @param key the key to the binary content; never null
* @return the input stream through which the content can be read
* @throws BinaryStoreException if there is a problem reading the content from the store
*/
InputStream getInputStream( BinaryKey key ) throws BinaryStoreException;
/**
* Mark the supplied binary keys as unused, but key them in quarantine until needed again
* (at which point they're removed from quarantine) or until
* removeValuesUnusedLongerThan(long, TimeUnit) is called. This method ignores any keys for
* values not stored within this store.
*
* Note that the implementation must *never* block.
*
* @param keys the keys for the binary values that are no longer needed
* @throws BinaryStoreException if there is a problem marking any of the supplied
* binary values as unused
*/
void markAsUnused( Iterable<BinaryKey> keys ) throws BinaryStoreException;
/**
* Remove binary values that have been unused for at least the specified amount of time.
*
* Note that the implementation must *never* block.
*
* @param minimumAge the minimum time that a binary value has been unused before it can be
* removed; must be non-negative
* @param unit the time unit for the minimum age; may not be null
* @throws BinaryStoreException if there is a problem removing the unused values
*/
void removeValuesUnusedLongerThan( long minimumAge,
TimeUnit unit ) throws BinaryStoreException;
/**
* Get the text that can be extracted from this binary content.
*
* @param binary the binary content; may not be null
* @return the extracted text, or null if none could be extracted
* @throws BinaryStoreException if the binary content could not be accessed
*/
String getText( BinaryValue binary ) throws BinaryStoreException;
/**
* Get the MIME type for this binary value.
*
* @param binary the binary content; may not be null
* @param name the name of the content, useful for determining the MIME type;
* may be null if not known
* @return the MIME type, or null if it cannot be determined (e.g., the Binary is empty)
* @throws IOException if there is a problem reading the binary content
* @throws RepositoryException if an error occurs.
*/
String getMimeType( BinaryValue binary,
String name ) throws IOException, RepositoryException;
/**
* Obtain an iterable implementation containing all of the store's binary keys. The resulting iterator may be lazy, in the
* sense that it may determine additional {@link BinaryKey}s only as the iterator is used.
*
* @return the iterable set of binary keys; never null
* @throws BinaryStoreException if anything unexpected happens.
*/
Iterable<BinaryKey> getAllBinaryKeys() throws BinaryStoreException;
Each BinaryStore implementation must provide a no-arg constructor and member fields can be configured via the repository configuration. Note that the BinaryStore implementation must also implement several setter methods, which the repository calls when the BinaryStore is initialized and may be called at any time after that (due to the repository configuration changing).
Minimum binary size
When the BinaryStore is initialized, the repository will use the setMinimumBinarySizeInBytes(...) method to specify the size for BinaryValue}}s that must be persisted within the {{BinaryStore. Any binary content smaller than this can be represented with InMemoryBinaryValue instances (meaning they will be persisted with property where it's used) or persisted in the BinaryStore. Note that if repository's configuration changes, the repository may set a minimum size threshold.
Minimum string size
The repository can also use the BinaryStore to store large string values. Any strings larger than the threshold set in the repository configuration will be stored in the BinaryStore and referenced in the node. Note that there is nothing to configure in the BinaryStore itself.
MIME type detection
When the BinaryStore is initialized, the repository will use the setMimeTypeDetector(...) method to give the BinaryStore a MimeTypeDetector instance it can use to determine the MIME type for any binary content. The BinaryStore is free to determine the MIME type at any time, including when the binary content is stored or only when the MIME type is needed (via the getMimeType(...) method). The BinaryStore is also free to persist this information, since binary content for a given SHA-1 never changes. Note that if repository's configuration changes, the repository may set a different MIME type detector.
Garbage collection
There are a number of ways in which the BinaryStore may contain binary content (keyed by the SHA-1) that are no longer used or referenced. The first is when a JCR client or the repository removes the last Property containing the Binary. A second case is when a JCR client uses a Session to create a javax.jcr.Binary value and clears the transient state (before the Session's transient state saved). Neither of these pose a problem, since the minimum requirement is that the BinaryStore contain at least the content that is referenced in the repository content. However, all unused binary content in the BinaryStore takes up storage space, so ModeShape defines a way for the repository and the BinaryStore to recover that unused storage.
The repository periodically runs a multi-phase garbage collection process to identify those binaries that are no longer referenced by repository content. When such binaries are discovered, the repository calls the BinaryStore's markAsUnused(...) method. The BinaryStore then quarantines the binaries; if any quarantined binaries are used again, the BinaryStore can remove them from quarantine. The repository then periodically calls the BinaryStore's removeValuesUnusedLongerThan(...) method to purge all binaries that have been quarantined for at least the specified period of time.
The quarantine approach means that when {{BinaryValue}}s are removed, there is a period of time that they can be reused without the expensive removal and re-adding of the binary content.
BinaryStore implementations
There are currently a couple of implementations of BinaryStore:
-
org.modeshape.jcr.value.binary.FileSystemBinaryStore - Stores each binary in a file on the file system, in a hierarchy of directories based upon the SHA-1 hash. The store does use Java's native OS file locks to prevent other processes from concurrently writing the files, and it also uses an internal set of locks to prevent mulitple threads from simultaneously writing to the persisted files. This store exposes buffered FileInputStream instances that directly access the underlying files.
-
org.modeshape.jcr.value.binary.InfinispanBinaryStore - Stores binary values within Infinispan, allowing the binary values to be chunked and distributed across the data grid (while the binary metadata is replicated across the grid). This option works really well for clustered topologies, since all processes in the cluster can access the same store. Two different caches are used: one for binary value metadata and one for the chunked values. Because the metadata for each value is very small (roughly 120 bytes), the metadata cache can be replicated, whereas the value cache can be replicated or distributed. Added in 3.1.0.Final
-
org.modeshape.jcr.value.binary.MongodbBinaryStore - Store binary values within a MongoDB instance, where the binary values are chunked and stored inside the database. It does use a local cache of binary values (backed by the file system store). Added in 3.1.0.Final
-
org.modeshape.jcr.value.binary.DatabaseBinaryStore - Store binary values within a JDBC database, where the binary values are stored as BLOBs in the underlying database. Added in 3.1.0.Final
-
org.modeshape.jcr.value.binary.CassandraBinaryStore - Store binary values within a Cassandra database, where the binary values are stored as BLOBs in the underlying database. Added in 3.4.0.Final
-
org.modeshape.jcr.value.binary.TransientBinaryStore - A customization of the FileSystemBinaryStore that uses the System's temporary directory (as defined by java.io.tmpdir). Useful for testing or transient repositories only.
-
org.modeshape.jcr.value.binary.CompositeBinaryStore - A binary store which is able to aggregate several binary stores of the type: file, infinispan, database or custom. Each nested binary store must have a unique name, under which it is aggregated by the composite store. When creating binary values, this name acts as a hint to binary value factory based on which a created value will go in one store or another. To create binary values for this type of store, you must use the org.modeshape.jcr.api.ValueFactory interface and the public Binary createBinary( InputStream value, String hint ) method. Added in 3.3.0.Final
We would like to have other options, including storage in S3 and Hadoop. But it's also possible for developers using ModeShape to write their own implementations.
Configuring Binary Stores
If you no explicit binary store configuration is present, the TransientBinaryStore implementation will be used by default. As explained above, this is not really suitable outside a testing context, as any binaries will be lost between restart.
To explicitly configure a Binary Store in the repository JSON configuration file, add a binaryStorage section to the main storage section.
For example:
"storage" : {
.........
"binaryStorage" : {
"type" : "file",
"directory": "target/persistent_repository/binaries"
}
}
will configure a FileSystemBinaryStore while
"storage" : {
.........
"binaryStorage" : {
"type" : "database",
"driverClass" : "org.h2.Driver",
"url" : "jdbc:h2:mem:target/test/binary-store-db;DB_CLOSE_DELAY=-1",
"username" : "sa"
}
}
will configure a DatabaseBinaryStore.
The valid list of types for the type attribute are: file, database, transient, cache, composite and custom.
Regardless of the type, all binary stores support the following attributes:
minimumBinarySizeInBytes
|
the minimum size (in bytes) above which binary values will be stored in the store. Any binary value lower in size will be stored together with the other node information
|
minimumStringSize
|
the minimum length of a string above which all strings are stored in the binary store (as an optimization)
|
Beside these, each binary store type has its own list of custom attributes it supports. For more information about each possible value see the repository schema .
Files and Folders
A very simple way of adding binary content into a repository is uploading files & folders.
The JCR specification defines the following node types:
[nt:folder] > nt:hierarchyNode
+ * (nt:hierarchyNode) version
[nt:file] > nt:hierarchyNode
+ jcr:content (nt:base) primary mandatory
[nt:resource] > mix:mimeType, mix:lastModified
- jcr:data (binary) primary mandatory
which means that a natural folder/file hierarchy would use a nt:folder/nt:file/jcr:content->jcr:data hierarchy.
ModeShape provides via the modeshape-jcr-api artifact a utility for creating such a hierarchy: org.modeshape.jcr.api.JcrTools#uploadFile. See the implementation for more information.